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Abstract 

Learning  to  rank  is  becoming  an  increasingly 
popular  research  area  in  machine  learning. 
The  ranking  problem  aims  to  induce  an  or¬ 
dering  or  preference  relations  among  a  set  of 
instances  in  the  input  space.  However,  col¬ 
lecting  labeled  data  is  growing  into  a  burden 
in  many  rank  applications  since  labeling  re¬ 
quires  eliciting  the  relative  ordering  over  the 
set  of  alternatives.  In  this  paper,  we  pro¬ 
pose  a  novel  active  learning  framework  for 
SVM-based  and  boosting-based  rank  learn¬ 
ing.  Our  approach  suggests  sampling  based 
on  maximizing  the  estimated  loss  differential 
over  unlabeled  data.  Experimental  results  on 
two  benchmark  corpora  show  that  the  pro¬ 
posed  model  substantially  reduces  the  label¬ 
ing  effort,  and  achieves  superior  performance 
rapidly  with  as  much  as  30%  relative  im¬ 
provement  over  the  margin-based  sampling 
baseline. 


1.  Introduction 

Learning  to  rank  has  recently  drawn  broad  attention 
among  machine  learning  researchers  (Joachims,  2002; 
Freund  et  al.,  2003;  Cao  et  ah,  2006).  The  objec¬ 
tive  of  rank  learning  is  to  induce  a  mapping  (ranking 
function)  from  a  predefined  set  of  instances  to  a  set 
of  partial  (or  total)  orders.  For  instance,  in  recom¬ 
mendation  systems  each  customer  is  represented  with 
a  set  of  features  ranging  from  the  income  level  to  age 
and  her  preference  order  over  a  set  of  products  (e.g. 
movies  in  Netflix) .  The  ranking  task  is  to  learn  a  map¬ 
ping  from  the  feature  space  to  a  set  of  permutations  of 
the  products.  The  applications  include  document  re- 

Appearing  in  Proceedings  of  the  25th  International  Confer¬ 
ence  on  Machine  Learning ,  Helsinki,  Finland,  2008.  Copy¬ 
right  2008  by  the  author(s)/owner(s). 


trieval,  collaborative  filtering,  product  rating,  and  so 
on.  In  this  paper,  we  are  interested  in  IR  applications, 
and  focus  on  document  retrieval.  A  number  of  queries 
are  provided  such  that  each  query  is  associated  with 
an  ordering  of  documents  indicating  the  relevance  of 
each  document  to  the  given  query.  Like  many  other 
ranking  applications,  this  requires  a  human  expert  to 
carefully  examine  the  documents  in  order  to  assign 
relevance-based  permutations.  It  is  often  unrealistic 
to  spend  extensive  human  effort  and  money  for  label¬ 
ing  in  ranking.  Thus,  it  is  crucial  to  design  methods 
that  will  considerably  reduce  the  labeling  effort  with¬ 
out  significantly  sacrificing  ranking  accuracy. 

The  active  learning  paradigm  addresses  this  type  of 
problem.  The  central  idea  is  to  start  with  only  a  small 
amount  of  labeled  examples  and  sequentially  select 
new  examples  to  be  labeled  by  an  oracle.  The  selected 
examples  are  then  added  to  the  training  set.  It  is  clear 
that  labeling  data  in  ranking  requires  a  complete  (or 
partial)  ordering  of  data  whereas  in  classification  la¬ 
beling  considers  only  absolute  class  assignments.  The 
target  domain  of  a  set  of  permutations  is  more  com¬ 
plex  than  that  of  absolute  classes.  Hence,  it  is  even 
more  crucial  to  select  the  most  informative  examples 
to  be  labeled  in  order  to  learn  a  ranking  model  using 
fewer  labeled  examples. 

In  this  paper,  we  propose  a  novel  active  sampling 
framework  for  SVM  rank  learning  (Joachims,  2002), 
or  RankSVM  in  short,  and  RankBoost  (Freund  et  ah, 
2003).  The  proposed  method  considers  the  capacity  of 
an  unlabeled  example  to  update  the  current  model  if 
rank-labeled  and  added  to  the  training  set.  We  show 
that  this  capacity  can  be  defined  as  a  function  that 
estimates  the  error  of  a  ranker  introduced  by  the  ad¬ 
dition  of  a  new  example.  The  capacity  function  takes 
different  forms  in  RankSVM  and  RankBoost  due  to 
different  formulations  of  the  ranking  function.  For 
example  in  the  case  of  RankSVM,  the  ranking  func¬ 
tion  is  defined  via  a  normal  vector  which  is  a  weighted 
sum  of  the  support  vectors  whereas  the  ranking  func- 
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tion  is  a  weighted  sum  of  weak  learners  in  RankBoost. 
However,  in  both  cases,  the  proposed  strategy  selects 
the  samples  which  are  estimated  to  produce  a  faster 
convergence  from  the  current  predicted  ranking  to  the 
true  ranking.  Our  empirical  evaluations  on  two  bench¬ 
mark  corpora  from  topic  distillation  tasks  in  TREC 
competitions  show  a  significant  advantage  favoring  our 
method  against  the  margin-based  sampling  heuristic 
of  (Brinker,  2004;  Yu,  2005)  and  a  random  sampling 
baseline. 

The  rest  of  the  paper  is  organized  as  follows:  Section  2 
provides  a  brief  literature  review  to  the  related  work. 
Section  3  motivates  the  choice  of  the  proposed  active 
learning  framework  and  introduces  two  novel  methods 
for  active  learning  in  the  RankSVM  and  RankBoost 
settings.  In  Section  4,  we  report  the  experimental  re¬ 
sults  and  demonstrate  the  effectiveness  of  our  methods 
on  benchmark  datasets.  Finally,  we  offer  our  conclu¬ 
sions  and  next  steps  in  Section  5. 

2.  Related  Work 

A  number  of  strategies  have  been  proposed  for  active 
learning  in  the  classification  framework.  Some  of  those 
center  around  the  version  space  (Mitchell,  1982)  reduc¬ 
tion  principle  (Tong  &  Roller,  2000):  selecting  unla¬ 
beled  instances  that  limit  the  volume  of  the  version 
space  the  most,  or  equivalently  selecting  the  ones  with 
the  smallest  margin.  Some  of  the  others  adopt  the  idea 
of  reducing  the  generalization  error  (Roy  &  McCallum, 
2001;  Xu  et  ah,  2003;  Nguyen  &  Smeulders,  2004;  Don- 
mez  et  ah,  2007):  the  selection  of  the  unlabeled  data 
that  has  the  highest  affect  on  the  test  error,  i.e.  points 
in  the  maximally  uncertain  and  highly  dense  regions 
of  the  underlying  data  distribution  (Xu  et  ah,  2003; 
Nguyen  &  Smeulders,  2004;  Donmez  et  ah,  2007). 

Unfortunately,  it  is  not  straightforward  to  extend  these 
theoretical  principles  to  ranking  problems.  The  gen¬ 
eralization  power  of  ranking  functions  is  measured  by 
different  evaluation  metrics  than  the  ones  used  for  clas¬ 
sification.  Moreover,  the  classical  performance  metrics 
for  ranking,  such  as  MAP  (Mean  Average  Precision), 
precision  at  the  kth  rank  cut-off,  NDCG  (Normalized 
Discounted  Cumulative  Gain),  etc.,  are  harder  to  di¬ 
rectly  optimize  than  the  classical  loss  functions  for 
classification,  i.e.  log  loss,  0/1  loss,  squared  loss,  etc. 

Recently,  there  have  been  attempts  to  address  the 
challenges  in  active  sampling  for  rank  learning. 
Brinker  (2004)  uses  a  notion  of  the  margin  as  an  ap¬ 
proximation  to  reducing  the  volume  of  the  version 
space.  The  margin  in  the  ranking  scenario  is  defined 
as  the  minimum  difference  of  scores  between  two  in¬ 


stances  assuming  the  ranking  solution  is  a  real- valued 
scoring  function.  Yu  (2005)  adopted  the  same  notion 
of  margin  for  SVM  rank  learning  and  proposed  a  batch 
mode  for  instance  selection  that  minimizes  the  sum  of 
the  rank  score  differences  of  all  data  pairs  within  a 
set  of  samples.  Yu  (2005)  proposed  an  efficient  im¬ 
plementation  which  considers  only  the  rank-adjacent 
pairs  and  showed  that  this  strategy  is  optimal  in  terms 
of  selecting  the  most  ambiguous  set  of  samples  with  re¬ 
spect  to  the  ranking  function.  The  major  drawback  of 
this  margin-based  sampling  method  of  (Brinker,  2004; 
Yu,  2005)  is  that  a  scoring  function  for  ranking  may 
assign  very  similar  scores  to  instances  with  the  same 
rank  label  since  the  ranking  function  does  not  distin¬ 
guish  between  the  relative  order  of  two  relevant  or 
two  non-relevant  examples.  However,  such  instances 
do  not  carry  any  additional  information  for  the  rank 
learner  to  distinguish  between  the  relevant  and  the 
non-relevant  data. 

Another  recent  development  in  active  rank  learning 
is  the  divergence-based  sampling  method  of  (Amini 
et  ah,  2006).  The  proposed  method  selects  the  sam¬ 
ples  at  which  two  different  ranking  functions  maxi¬ 
mally  disagree.  One  of  the  two  functions  is  the  current 
ranking  function  trained  on  the  labeled  data,  and  the 
other  is  a  randomized  function  obtained  by  cross  vali¬ 
dation.  The  divergence-based  strategy  is  effective  only 
when  provided  with  a  sufficiently  large  initial  labeled 
set,  which  is  impractical  for  many  real-world  ranking 
applications,  such  as  document  retrieval. 

3.  Active  Sampling  in  Rank  Learning 

3.1.  Motivation 

This  section  presents  a  novel  method  for  active  learn¬ 
ing  using  RankSVM  and  RankBoost.  Roy  and  McCal¬ 
lum  (2001)  argue  that  an  optimal  active  learner  is  the 
one  that  asks  for  the  labels  of  the  examples  that,  once 
incorporated  into  training,  would  result  in  the  lowest 
expected  error  on  the  test  set.  The  expected  error  on 
the  test  set  can  be  estimated  using  the  posterior  dis¬ 
tribution  Pd{u  |  x )  of  class  labels  estimated  from  the 
training  set  using  some  loss  function  L 

EPd  =  f  L{P(y  |  x),PD(y  \  x))P(x)  (1) 

J  X 

Their  aim  is  then  to  select  the  point  x*  such  that  when 
added  to  the  training  set  with  a  chosen  label  y* ,  the 
classifier  trained  on  the  new  set  {D  +  (x*,y*)}  would 
have  less  error  than  any  other  candidate  x. 

V(x,  y)Ep  <  Ep 

V  Pd  +  (x*,V*)  ~  FD  +  (x,y) 


(2) 
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Since  the  true  label  y*  is  unknown,  the  expectation 
calculation  is  carried  out  by  calculating  the  estimated 
error  for  each  possible  label  y  £  Y,  and  then  taking 
the  average  weighted  by  the  current  learner’s  posterior 
Pd(v  |  x).  The  naive  implementation  of  this  method 
would  be  quite  inefficient  and  almost  intractable  on 
large  datasets.  Roy  and  McCallum  (2001)  address  this 
problem  using  fast  updates  for  a  Naive  Bayes  classifier. 
Although  efficient  re-training  procedures  are  available 
for  some  learners  such  as  SVMs  (Cauwenberglrs  &  Pog- 
gio,  2000),  it  would  still  be  infeasible  for  ranking  tasks, 
especially  considering  the  interactive  nature  of  rank¬ 
ing  systems.  In  this  paper,  we  propose  a  method  to 
estimate  how  likely  the  addition  of  a  new  example  will 
result  in  the  lowest  expected  error  on  the  test  set  with¬ 
out  any  re-training  on  the  enlarged  training  set.  Our 
method  is  based  on  the  likelihood  of  an  example  to 
change  the  current  hypothesis  significantly.  There  are 
a  number  of  reasons  why  we  believe  this  is  a  reasonable 
indicator  for  estimating  that  error: 

•  Adding  a  new  data  point  to  the  labeled  set  can 
only  change  the  error  on  the  test  set  if  it  changes 
the  current  learner. 

•  The  more  significant  that  change,  the  greater 
chance  to  learn  the  true  hypothesis  faster. 

•  We  note  that  a  big  change  in  the  current  hypoth¬ 
esis  might  not  always  lead  to  better  generaliza¬ 
tion.  However,  as  more  data  is  sampled  and  the 
hypothesis  gets  closer  to  the  truth,  it  is  less  likely 
that  a  single  outlier  could  hurt  the  performance 
noticeably. 

In  the  following  sections,  we  briefly  review  the 
RankSVM  and  the  RankBoost  algorithms  and  propose 
a  novel  active  learning  method  for  each. 

3.2.  Preliminaries 

Assume  the  data  is  represented  as  a  set  of  feature  vec¬ 
tors  and  corresponding  labels  (ranks)  y  £Y  = 

{ri,  r2, ...,  rn}  where  n  denotes  the  number  of  ranks. 
We  assume  binary  relevance  in  this  paper,  though  our 
framework  can  be  generalized  to  multi-level  ranking 
scenarios  as  long  as  the  rank  learner  works  on  pair¬ 
wise  preference  relationships,  which  is  the  case  for  the 
majority  of  rank  learning  algorithms.  Features  are  nu¬ 
merical  values  for  attributes  in  the  data.  Assume  fur¬ 
ther  that  there  exists  a  preference  relationship  between 
data  vectors  such  that  yt  >-  yj  denotes  Xi  is  ranked 
higher  than  Xj.  A  perfect  ranking  function  f  £  F 
preserves  the  order  relationships  between  instances: 

Xi  >-  Xj  /(£,:)  >  f(Xj) 


Suppose  we  are  given  a  set  of  instances  D  =  {(xj,  yi)  : 
( Xi,yi )  £  X  x  The  objective  for  rank  learning 

is  to  learn  a  mapping  /  :  X  x  Y  t— >  I  that  minimizes 
a  given  loss  function  on  the  training  data. 

3.3.  SVM  Rank  Learning 

Assume  f  £  F  is  a  linear  function,  i.e.  f{x)  =  (w,x), 
that  satisfies 

Xi  >-  Xj  <t=>  (w,  Xf)  >  (w,  Xj) 

The  SVM  model  targeting  this  problem  can  be  formu¬ 
lated  as  a  Quadratic  Optimization  problem: 

min^INf  +  (3) 

w  Z  z — 

subject  to  (w,  Xj)  >  (w,  Xj)  +  1  —  >  0  Vi,  j 

The  above  optimization  can  be  equivalently  written 
by  re-arranging  the  constraints  and  substituting  the 
trade-off  parameter  C  for  A  =  ^  as  follows: 

K 

minX!  [1~zk{™,4-4)\+  +  M\™\\2  (4) 

w  —  1 

/c=l 

where  [...],  indicates  the  standard  hinge  loss,  x1  —  x2 
is  a  pairwise  difference  vector  whose  label  z  is  positive, 
i.e.,  z  =  +1  if  x1  >-  x2  and  z  =  —1  otherwise.  K  is 
the  total  number  of  such  pairs  in  the  training  set.  Fi¬ 
nally,  a  ranked  list  is  obtained  by  sorting  the  instances 
according  to  the  output  of  the  ranking  function  in  de¬ 
scending  order. 

3.4.  Active  Sampling  for  RankSVM 

Let  us  consider  a  candidate  example  x  £  U,  where 
U  is  the  set  of  unlabeled  examples.  Assume  x  is 
incorporated  into  the  labeled  set  with  a  rank  label 
y  £  Y.  We  denote  the  total  loss  on  the  instance 
pairs  that  include  £  by  a  function  of  x  and  ui ,  i.e. 
D(x,  w)  =  Ylrj= i  [1  —  zj  {w,  Xj  —  a?}]+  where  Jy  is  the 
number  of  examples  in  the  training  set  with  a  differ¬ 
ent  label  than  the  label  y  of  x.  For  instance,  Jy  is 
the  number  of  negative(non-relevant)  examples  in  the 
training  set  if  y  is  assumed  to  be  positive(relevant), 
and  vice  versa.  The  objective  function  to  be  mini¬ 
mized  by  RankSVM  then  becomes: 

min{AIH2  +Y1  [l-Zk(w,xl.  -xl)]+  +D(x,w) | 

(5) 

Assume  w*  is  the  solution  to  the  optimization  in  Equa¬ 
tion  4,  and  it  is  unique.  Burges  and  Crisp  (2000)  show 
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the  necessary  and  sufficient  conditions  for  the  unique¬ 
ness  of  the  SVM  solution.  There  are  only  rare  cases 
where  uniqueness  does  not  hold,  thus  it  is  a  rather 
safe  assumption  to  make.  Since  we  do  not  actually  re¬ 
run  the  optimization  problem  on  the  enlarged  data,  we 
restrict  ourselves  to  the  current  solution(hypothesis) 
w* .  Instead  of  re-optimizing,  we  estimate  the  effect 
of  adding  each  candidate  instance  on  the  training  loss 
using  the  current  solution  to  tell  how  much  incorpo¬ 
rating  x  into  the  labeled  set  is  likely  to  change  the 
current  hypothesis.  First,  let  us  consider  two  cases. 

1.  Assume  w*  =  argmin^  D(w,  x) 

Then,  w*  is  also  the  solution  to  the  optimization 
problem  in  Equation  5,  combining  the  assump¬ 
tion  with  w*  being  the  solution  to  Equation  4. 
That  means,  adding  x  to  the  training  set  would 
not  change  the  current  hypothesis.  From  an  ac¬ 
tive  learning  point  of  view,  this  example  is  useless 
since  the  learning  algorithm  is  indifferent  to  its 
inclusion. 


Algorithm  1  RankBoost 

Input:  initial  data  distribution  D\  over  X  x  X 
for  t  =  1  to  T  do 

Train  a  weak  learner  on  Dt 

Obtain  the  weak  ranking  ht  '■  X  R 

Choose  a  weight  at  €  K.  for  ht 

Dt+i(x 1  x2)  =  Dt(sl  ’s2')exP(-at(h  ds^-Ms2))) 

end  for 


We  substitute  w  in  Equation  7  for  the  current  weight 
vector  w*  to  estimate  how  the  solution  of  Equation  6 
deviates  from  it,  i.e.  ||«;*  —  u?||  =  ||Att?||.  We  can 
now  write  the  magnitude  of  the  total  derivative  as  a 
function  of  x  and  the  rank  label  y  as  follows: 

g{x,  y)  =  ||A«?||  =  ^2  ||AuJj|[  (8) 

3 

Jo  if  Zj  (w* ,  Xj  —  x)  >  1 

~i  \ II  -  Zj{xj  -  a?)||  if  Zj  ( w*,Xj  -  x)  <  1 


2.  Assume  w*  ^  argmin^j  D(w,  x) 

This  is  the  situation  where  the  current  solution 
could  be  different  if  that  example  x  were  incorpo¬ 
rated  into  training.  The  magnitude  of  the  differ¬ 
ence  depends  on  the  magnitude  of  the  deviation 
of  D(w*,x)  from  its  optimal  value,  min^D^w,  x). 

We  now  study  the  second  case  in  more  detail.  Let  w 
be  the  weight  vector  that  minimizes  D(w,x),  i.e.  w  = 
arginin^  D(w,  x).  Then,  as  the  difference  ||tc*  —  ru|| 
increases  it  becomes  less  likely  that  w*  is  optimal  for 
Equation  5.  In  other  words,  the  current  solution  w*  is 
in  most  need  of  updating  in  order  to  compensate  for 
the  loss  on  the  new  pairs.  Let  us  write  w  in  terms  of 
w*  as  follows: 

w  =  w*  —  Aw 

Minimizing  D(w,x)  requires  working  with  the  hinge 
loss,  the  direct  optimization  of  which  is  difficult  due 
to  the  discontinuity  of  the  derivative.  However,  it 
can  still  be  solved  using  a  gradient-descent-type  algo¬ 
rithm1.  Recall  the  objective  function  to  be  minimized: 

Jy 

min  D(w,  x)  =  min  N  [1  —  Zj  (w,  Xj  —  x)}+  (6) 

W  W  ^ — J 

3  =  1 

The  derivative  of  the  above  equation  with  respect  to 
w  at  a  single  point  Xj,  A wj,  is: 

0  if  Zj  (w,  Xj  —  x)  >  1  ^ 

—Zj  (xj  —  x)  if  Zj  (w,  Xj  —  x)  <  1 

1For  a  detailed  discussion  on  solving  SVM  rank  learning 
using  gradient  descent,  see  (Cao  et  ah,  2006). 


g(x,  y)  estimates  how  likely  the  current  hypothesis  is 
to  be  updated  to  minimize  the  loss  introduced  as  a 
result  of  the  addition  of  the  example  x  with  the  rank 
label  y.  Thus,  we  use  this  function  to  estimate  the 
ability  of  each  unlabeled  candidate  example  to  change 
the  current  learner  if  incorporated  into  training.  Since 
the  true  labels  of  the  candidate  examples  are  unknown, 
we  use  the  current  learner  to  estimate  the  true  label 
probabilities.  Then,  we  can  take  the  expectation  of 
g(x,  y)  by  taking  the  weighted  sum  over  the  current 
posterior  P{y  \  x)  for  all  y  G  Y.  Among  all  the  un¬ 
labeled  examples,  we  choose  the  one  with  the  highest 
value  for  that  expectation: 

x*  =  argmax  E  P{y  |  x)g(x,y ) 

Seu  veY 

=  argmax <  P(y  =  1  |  x)g{ f ,  y  =  1)  +  (9) 

xGU  l 

P(y  =  -1  |  x)g(x,y  =  -1)  j 

3.5.  RankBoost  Learning 

RankBoost  is  a  boosting  algorithm  designed  for  rank¬ 
ing  problems.  Like  all  algorithms  in  boosting  fam¬ 
ily,  RankBoost  learns  a  weak  learner  on  each  round, 
and  maintains  a  distribution  Dt  over  the  ranked  pairs, 
X  x  X,  to  emphasize  the  pairs  whose  relative  order  is 
the  hardest  to  learn.  An  outline  of  the  algorithm  is 
given  as  Algorithm  1.  Zt  is  a  normalization  constant, 
and  the  final  ranking  is  a  weighted  sum  of  the  weak 
rankings  H(x)  =  Yht- iatPt{x).  For  more  details  and 
theoretical  discussion  see  (Freund  et  al.,  2003). 
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3.6.  Active  Sampling  for  RankBoost 

This  section  introduces  a  similar  method  for  active 
sampling  for  the  RankBoost  algorithm  (Freund  et  ah, 
2003).  Consider  a  candidate  point  x  €  U  and  assume 
it  is  merged  into  the  training  set  with  rank  label  y  €  Y. 
Unlike  RankSVM,  RankBoost  algorithm  does  not  di¬ 
rectly  operate  with  an  optimization  function.  But  the 
ranking  loss  with  respect  to  the  distribution  at  time  t 
can  be  written  as: 

Y,  Dt{xx ,x2)I{H{x2)  >  Hix1))  (10) 


the  loss  incurred  by  including  this  example.  Note  that 
I(x  >  0)  <  ex  for  \/x  €  R  (Freund  et  ah,  2003).  There¬ 
fore,  the  upper  bound  on  A L  can  be  written  as: 


A  L(x,y 


^  exp(2(H(xJ)  -  H(x))) 

’  -  2^  rr.  z. 


(14) 


A L(x,y  =  —1)  can  be  similarly  bounded,  e.g. 

a L(x,y  =  -l)  <  NoWi  the 

loss  difference  can  be  estimated  by  taking  the  expec¬ 
tation  over  the  possible  rank  labels  of  x  with  respect 
to  the  current  ranker’s  posterior,  P(y  \  x): 


where  /  is  defined  to  be  1  if  the  predicate  holds  and  0 
otherwise.  Hence,  this  is  a  sum  over  misranked  pairs, 
assuming  a;1  >-  x2.  The  distribution  at  time  T  +  1  can 
be  written  as: 


Dr+iix1  ,x2) 


Di(xx,  x2) 


exp(H(x2)  —  H (a:1)) 

WZt 


(H) 

The  initial  distribution  term  D\  can  be  dropped  with¬ 
out  loss  of  generality,  assuming  it  is  uniform  (which  is 
reasonable  given  the  fact  that  we  do  not  have  prior  in¬ 
formation  about  the  data).  Similarly  to  RankSVM,  we 
would  like  to  estimate  how  much  the  current  ranking 
function  would  change  if  the  point  x  were  in  the  train¬ 
ing  set.  We  estimate  this  deviation  by  the  difference 
in  the  ranking  loss  after  enlarging  the  current  labeled 
set  with  each  example  x  €  U.  The  ranking  loss  on  the 
enlarged  set  with  respect  to  the  distribution  Dt+ i  is: 


E 

X1 , X * 


exp(H(x2)  —  H(  x1)) 

Tizt 


I(H(x2)  >  H(x1))+ 


Ep(AL(x))  =  P(y  =  1  |  x)AL(x,y  =  1)+ 

P(y  =  -1  I  x)AL(x,  y  =  -1)  (15) 

Note  the  similarity  with  Equation  9  in  the  SVM 
case.  Finally,  we  select  the  instance  x  that  has 
the  highest  expected  loss  differential,  e.g.  x*  = 
argmax^  Ep(AL(x)).  For  notational  clarity,  we  take 
the  maximum  over  the  upper  bound  in  Equation  14  as 
follows: 

x*  =  argmaxl  P(y  =  1  j  a?)(V^  exp{2{H{xp—H{x)))) 
■‘ur  1 

+  P(y  =  1  |  x)(Y  exp(2(H(x)  H(xm))))\ 

X,Xm  ' 

(16) 

For  simplicity,  we  leave  out  the  normalization  constant 
J([(  Zt  since  we  are  interested  in  the  relative  expecta¬ 
tion  rather  than  the  absolute  expectation. 


(12) 


XJ  ,X 


n  tzt 


Note  that  the  rank  label  y  of  x  is  assumed  to  be  pos¬ 
itive  (relevant)  with  x  >-  Xj  in  this  case.  We  have  a 
similar  calculation  for  the  case  where  y  is  assumed  to 
be  negative  (non- relevant).  We  adopt  the  distribution 
Dt+i  because  1)  it  can  easily  be  written  in  terms  of 
the  final  ranking  function,  2)  it  contains  information 
about  which  pairs  remain  the  hardest  to  determine  af¬ 
ter  the  iterative  weight  updates.  Then,  the  difference 
in  the  ranking  loss  between  the  current  and  the  aug¬ 
mented  set  simply  becomes: 

&L(x,y  =  1)  =  £  ^(H(g)-H(f))f(H(_,)  ^  H{3)) 

(13) 

This  difference  indicates  how  much  the  current  rank¬ 
ing  function  needs  to  be  modified  to  compensate  for 


3.7.  Final  Selection 

The  sample  selection  in  both  RankSVM  and  Rank- 
Boost  requires  estimating  a  posterior  label  distribu¬ 
tion.  We  adopt  a  sigmoid  function  to  estimate  that 
posterior  in  the  SVM  case,  as  suggested  by  (Platt, 
1999): 


P(y  |  x)  = 


1 


1  +  exp(—y  *  f(x)  +  C) 


where  f(x)  is  the  real- valued  score  of  the  ranking  al¬ 
gorithm,  and  C  is  a  constant  for  calibrating  the  esti¬ 
mate.  C  is  tuned  on  a  separate  corpus  not  used  for 
evaluation  in  this  paper.  The  final  ranking  in  Rank- 
Boost  is  a  sum  of  weak  learners  with  the  corresponding 
weights.  When  the  weights  are  too  small  (or  too  large), 
the  posterior  gets  close  to  the  extreme  (either  0  or  1) 
regardless  of  the  example.  Hence,  we  normalize  the 
RankBoost  output  dividing  by  the  maximum  possible 
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Figure  1.  Comparison  of  different  active  learners  on  TREC  2003.  The  horizontal  line  indicates  the  performance  when  the 
entire  training  data  is  used.  Only  ~  15%  of  the  training  data  is  actively  labeled  in  total  by  each  method. 


rank  score  without  changing  the  rank  order: 


Hy  I  x) 


1 

1  +  exp{-y  *  N/.  x)  -  +  C) 
Z^t=l  1 


Note  maxgH(x)  =  maxsY^t=iatht(x)  =  J2t=iat 
since  the  weak  learner  ht(x)  in  RankBoost  is  a  {0,1}- 
valued  function  defined  on  the  ordering  information 
provided  by  the  corresponding  feature  (Freund  et  ah, 
2003). 


4.  Evaluation 

4.1.  Data  and  Settings 

We  used  two  datasets  in  the  experiments:  TREC  2003 
and  2004  topic  distillation  tasks  in  LETOR  (Liu  et  ah, 
2007).  The  topic  distillation  task  in  TREC  is  very  sim¬ 
ilar  to  web  search  where  a  page  is  considered  relevant 
to  a  query  if  it  is  an  entry  page  of  some  web  site  rele¬ 
vant  to  the  query.  The  relevance  judgments  on  the  web 
pages  with  respect  to  the  queries  are  binary.  There  are 
44  features,  e.g.  content  and  hyperlink  features,  each 
of  which  is  extracted  from  each  document-query  pair 
and  normalized  into  [0, 1].  There  are  50  and  75  queries 
with  1%  and  0.6%  relevant  documents  in  TREC03 
and  TREC04,  respectively.  The  total  number  of  docu¬ 
ments  per  query  is  ~  1000  for  both  datasets.  We  used 


the  standard  train/test  splits  over  5  folds  in  LETOR. 
For  each  fold,  we  randomly  picked  16  documents  in¬ 
cluding  exactly  one  relevant  document  per  query  for 
initial  labeling.  The  remaining  training  data  is  consid¬ 
ered  as  the  unlabeled  set.  We  compared  our  method 
with  the  margin-based  sampling  of  (Brinker,  2004;  Yu, 
2005)  and  random  sampling  baselines.  Each  method 
selects  5  documents  per  query  for  labeling  at  each 
round,  e.g.  our  method  selects  the  top  5  documents 
according  to  the  criteria  in  Equation  9  and  16.  Then, 
the  ranking  function  is  re-trained,  and  evaluated  on 
the  test  set.  This  process  is  repeated  for  25  iterations 
which  corresponds  to  labeling  only  ~  15%  of  the  entire 
training  data.  The  reported  results  are  averaged  over 
5  folds. 

We  adopted  two  standard,  widely  used  performance 
metrics  for  evaluation,  namely  the  Mean  Average 
Precision  (MAP)  and  the  Normalized  Discounted 
Cumulative  Gain  (NDCG)  (Jarvelin  &  Kekalainen, 
2002).  For  a  single  query,  average  precision  is  de¬ 
fined  as  the  average  of  the  precision  as  computed 
at  each  rank  for  all  relevant  documents:  AP  = 

c„\\ 

where  r  is  the  rank,  N 


Ef=1(P(r)*rei(r)) 


#  relevant  documents  for  this  query 

is  the  number  of  documents  retrieved,  rel()  is  a  bi¬ 
nary  function  on  the  relevance  of  a  given  rank,  and 

P(r)  =  #  relevant  docs  in  top  r  results  ^  ^  predsion  at 
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Figure  2.  Comparison  of  different  active  learners  on  TREC  2004.  The  horizontal  line  indicates  the  performance  when  the 
entire  training  data  is  used.  Only  ~  15%  of  the  training  data  is  actively  labeled  in  total  by  each  method. 


the  rank  cut-off  r.  MAP  is  obtained  by  averaging  the 
AP  values  for  all  queries.  NDCG  is  cumulative  and  dis¬ 
counted  since  the  overall  utility  of  a  list  is  measured 
by  the  sum  of  the  gain  of  each  relevant  document,  but 
the  gain  is  discounted  as  a  function  of  rank  position. 
The  NDCG  value  of  a  rank  list  at  position  n  is  given 
as  follows:  NDCG@n  =  Zn  J2]=i  TFgJFfj)  where  r(j) 
is  the  rank  of  the  jth  document  in  the  list,  and  Zn  is 
the  normalization  constant  so  that  a  perfect  ranking 
yields  an  NDCG  score  of  1. 

4.2.  Results 

Figure  1  and  2  plot  the  performance  of  the  pro¬ 
posed  method  (denoted  by  DiffLoss),  and  as  compara¬ 
tive  baselines,  the  margin-based  sampling  and  random 
sampling  strategies  on  TREC  2003  and  2004  datasets. 
DiffLoss  has  a  clear  advantage  over  margin-based  and 
random  sampling  in  all  cases  with  respect  to  differ¬ 
ent  evaluation  metrics.  The  differences  over  the  en¬ 
tire  operating  range  are  also  statistically  significant 
( p  <  0.0001)  according  to  a  two-sided  paired  t-test  at 
95%  confidence  level.  DiffLoss  especially  achieves  30% 
relative  improvement  over  the  margin-based  sampling 
for  RankSVM  on  TREC  2003  dataset. 

The  horizontal  line  in  each  figure  indicates  the  perfor¬ 


mance  if  all  the  training  data  was  used,  which  we  call 
the  “optimal”  performance.  The  performance  of  Dif¬ 
fLoss  for  RankBoost  is  comparable  to  the  “optimal”  on 
TREC  2003  and  2004  datasets.  In  case  of  RankSVM, 
DiffLoss  is  close  to  the  “optimal”  on  TREC  2003,  and 
outperforms  it  on  TREC  2004  dataset.  More  precisely, 
DiffLoss  using  RankSVM  reaches  the  optimal  perfor¬ 
mance  (even  surpassing  it  on  TREC  2004)  after  10 
rounds  of  labeling  on  average  (labeling  5  documents 
per  query  at  each  round).  DiffLoss  using  RankBoost, 
on  the  other  hand,  reaches  95%  and  90%  of  the  optimal 
performance  on  MAP  and  NDCG@10,  respectively  on 
TREC  2004  dataset  after  10  rounds.  This  suggests 
that  carefully  chosen  samples  might  lead  to  a  higher 
level  of  accuracy  than  blindly  using  large  amounts  of 
training  data.  This  is  an  important  development  over 
traditional  supervised  rank  learning  since  it  not  only 
reduces  the  expensive  labeling  effort,  but  also  may  lead 
to  greater  generalization  power.  As  follow-up  work,  we 
intend  to  explore  methods  that  will  automatically  tell 
the  sampling  algorithm  when  to  stop  so  that  maximum 
gain  with  minimum  cost  is  obtained,  as  well  as  explor¬ 
ing  the  underlying  criteria  for  measuring  the  quality 
of  actively  selected  examples. 

We  conducted  another  set  of  experiments  to  test  the 
hypothesis  that  selecting  a  diverse  set  of  samples  might 
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lead  to  better  results.  We  adopted  the  maximal 
marginal  relevance  principle  of  (Carbonell  &  Gold¬ 
stein,  1998),  originally  proposed  for  text  summariza¬ 
tion.  The  idea  is  to  select  samples  for  labeling  such 
that  they  have  both  the  maximum  potential  to  change 
the  current  ranking  function  and  are  maximally  dis¬ 
similar  to  each  other.  See  (Carbonell  &  Goldstein, 
1998)  for  more  details.  However,  incorporating  this  di¬ 
versity  principle  into  our  selection  criteria  only  slightly 
improved  our  results  at  the  very  beginning  of  the  learn¬ 
ing  curve,  but  the  improvement  vanished  afterwards. 
Thus,  we  do  not  report  these  results  here  in  this  paper. 

5.  Conclusion 

We  proposed  two  novel  active  sampling  methods  based 
on  SVM  rank  learning  and  RankBoost.  Our  frame¬ 
work  relies  on  the  estimated  risk  of  the  ranking  func¬ 
tion  on  the  labeled  set  after  adding  a  new  instance  with 
all  possible  labels.  The  samples  with  the  largest  ex¬ 
pected  risk(loss)  differential  are  selected  to  maximize 
the  degree  of  learning  at  the  fastest  rate.  Empirical 
results  on  two  standard  test  collections  indicate  that 
our  method  significantly  reduces  the  required  number 
of  labeled  examples  to  learn  an  accurate  ranking  func¬ 
tion.  Possible  extensions  of  this  work  include  a  study 
of  the  risk  minimization  in  terms  of  direct  optimization 
of  ranking  performance  metrics,  such  as  MAP,  NDCG, 
precision@k,  etc.  and  self-regulating  algorithms  that 
can  decide  when  to  terminate. 

Acknowledgments 

This  material  is  based  in  part  upon  work  supported 
by  the  Defense  Advanced  Projects  Research  Agency 
(DARPA)  under  Contract  No.  FA8750-07-D-0185. 

References 

Amini,  M.,  Usunier,  N.,  Laviolette,  F.,  Lacasse,  A.,  & 
Gallinari,  P.  (2006).  A  selective  sampling  strategy 
for  label  ranking.  ECML  ’06  (pp.  18-29). 

Brinker,  K.  (2004).  Active  learning  of  label  ranking 
functions.  ICML  '04  (pp.  17-24). 

Burges,  C.,  &  Crisp,  D.  (2000).  Uniqueness  of  the  svm 
solution.  NIPS  ’00  (pp.  223-229). 

Cao,  Y.,  Xu,  J.,  Liu,  T.-Y.,  Li,  H.,  Huang,  Y.,  &  Hon, 
H.-W.  (2006).  Adapting  ranking  svm  to  document 
retrieval.  Proceedings  of  the  international  ACM  SI- 
GIR  Conference  on  Research  and  Development  in 
information  retrieval  (SIGIR'06)  (pp.  186-193). 

Carbonell,  J.,  &  Goldstein,  J.  (1998).  The  use  of  mmr, 


diversity-based  reranking  for  reordering  documents 
and  producing  summaries.  SIGIR  '98  (pp.  335-336). 

Cauwenberghs,  G.,  &  Poggio,  T.  (2000).  Incremental 
and  decremental  support  vector  machine  learning. 
NIPS  ’00  (pp.  409-415). 

Donmez,  P.,  Carbonell,  J.,  &  Bennett,  P.  (2007).  Dual 
strategy  active  learning.  Proceedings  of  the  Euro¬ 
pean  Conference  on  Machine  Learning  (pp.  116- 
127). 

Freund,  Y.,  Iyer,  R.,  Scliapire,  R.,  &  Singer,  Y.  (2003). 
An  efficient  boosting  algorithm  for  combining  pref¬ 
erences.  Journal  of  Machine  Learning  Research,  f, 
933-969. 

Jarvelin,  K.,  &  Kekalainen,  J.  (2002).  Cumulated  gain- 
based  evaluation  of  ir  techniques.  A  CM  Transaction 
on  Information  Systems,  20(f),  422-446. 

Joachims,  T.  (2002).  Optimizing  search  engines  using 
clickthrough  data.  Proceedings  of  ACM  SIGKDD 
International  Conference  on  Knowledge  Discovery 
and  Data  Mining  (KDD'02). 

Liu,  T.,  Xu,  J.,  Qin,  T.,  Xiong,  W.,  Wang,  T.,  &  Li, 
H.  (2007).  Letor:  Benchmark  dataset  for  research 
on  learning  to  rank  for  information  retrieval.  SIGIR 
’07  Workshop:  Learning  to  Rank  for  IR. 

Mitchell,  T.  (1982).  Generalization  as  search.  Journal 
of  Artificial  Intelligence,  18,  203-226. 

Nguyen,  H.,  &  Smeulders,  A.  (2004).  Active  learning 
with  pre-clustering.  ICML  ’Of  (pp.  623-630). 

Platt,  J.  (1999).  Probabilistic  outputs  for  support  vec¬ 
tor  machines  and  comparisons  to  regularized  likeli¬ 
hood  methods.  Advances  in  Large  Margin  Classi¬ 
fiers,  61-74. 

Roy,  N.,  &  McCallum,  A.  (2001).  Toward  optimal 
active  learning  through  sampling  estimation  of  error 
reduction.  ICML  ’01  (pp.  441-448). 

Tong,  S.,  &  Roller,  D.  (2000).  Support  vector  ma¬ 
chine  active  learning  with  applications  to  text  clas¬ 
sification.  Proceedings  of  International  Conference 
on  Machine  Learning  (pp.  999-1006). 

Xu,  Z.,  Yu,  K.,  Tresp,  V.,  Xu,  X.,  &  Wang,  J.  (2003). 
Representative  sampling  for  text  classification  using 
support  vector  machines.  ECIR  ’03. 

Yu,  H.  (2005).  Svm  selective  sampling  for  ranking  with 
application  to  data  retrieval.  SIGKDD  ’05  (pp.  354- 
363). 


